Molecular Biology and Evolution
◐ Oxford University Press (OUP)
All preprints, ranked by how well they match Molecular Biology and Evolution's content profile, based on 488 papers previously published here. The average preprint has a 0.21% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Natsidis, P.; Kapli, P.; Schiffer, P. H.; Telford, M. J.
Show abstract
Introductory paragraphThe availability of complete sets of genes from many organisms makes it possible to identify genes unique to (or lost from) certain clades. This information is used to reconstruct phylogenetic trees; to identify genes involved in the evolution of clade specific novelties; and for phylostratigraphy - identifying ages of genes in a given species. These investigations rely on accurately predicted orthologs. Here we use simulation to produce sets of orthologs which experience no gains or losses. We show that errors in identifying orthologs increase with higher rates of evolution. We use the predicted sets of orthologs, with errors, to reconstruct phylogenetic trees; to count gains and losses; and for phylostratigraphy. Our simulated data, containing information only from errors in orthology prediction, closely recapitulate findings from empirical data. We suggest published downstream analyses must be informed to a large extent by errors in orthology prediction which mimic expected patterns of gene evolution.
Koers, C.; Bierman, R. F.; Xu, H.; Akey, J. M.
Show abstract
The ratio of nonsynonymous (dN) to synonymous (dS) substitutions in protein-coding genes is a fundamental metric in molecular evolution to test hypotheses about the relative contributions of genetic drift and natural selection in shaping patterns of protein divergence (Williams et al., 2020). However, interpretation of dN/dS ratios may be confounded by sequence context and specific substitution models (Hughes, 2007; Kryazhimskiy & Plotkin, 2008). We present MutagenesisForge, a modular command-line tool and Python package for simulating codon-level mutagenesis and calculating dN/dS under user-specified conditions. At its core is the MutationModel interface which supports specific substitution matrices and ensures consistency across both Exhaustive and Contextual modes of simulation. These modes allow for users to test evolutionary hypotheses or to generate null distributions of dN/dS across a range of biologically relevant models. As large-scale DNA sequencing data sets continue to be generated both within and between species, MutagenesisForge offers a flexible platform for evolutionary analysis and hypothesis testing of mutational processes in protein-coding genes.
Dighe, A.; Maziarz, J.; IbrahimHashim, A.; Gatenby, R.; Kshitiz, ; Levchenko, A.; Wagner, G. P.
Show abstract
Changes in transcriptional gene expression is a dominant mode of evolution, mostly driven by mutations at cis-regulatory regions. Mutations can affect gene expression in multiple cell types if the same cis-regulatory elements are used by different cell types. As a consequence, changes in gene expression in one cell type may be associated with similar gene expression changes in another cell type. Correlated gene expression change can explain correlated character evolution, as for instance the correlation between placental invasion and vulnerability to cancer malignancy. Here we test this hypothesis using a comparative and an experimental data set. Specifically, we investigate gene expression in dermal skin fibroblasts (SF) and uterine endometrial stomal fibroblasts (ESF). The comparative dataset consists of transcriptomes from cultured SF and ESF from 9 mammalian species. We calculated the independent phylogenetic contrasts (PIC) for each gene and cell type. We find that evolutionary changes in gene expression in SF and ESF are highly correlated, supporting the hypothesis that the correlated gene expression changes are a prevalent feature of gene expression evolution. The experimental data set derives from a SCID mouse strain that was selected for slow cancer growth which led to substantial changes in the SF compared to wild type SCID mice. We isolated SF and ESF from wild type and evolved SCID mice and compared their gene expression profiles. We find a significant correlation between the gene expression contrasts of SF and ESF, which supports the hypothesis that gene expression variation in SF and ESF is correlated. We discuss the implications of these findings for the hypothesized correlation between placental invasiveness and vulnerability to metastatic cancer.
Bricout, R.; Weil, D.; Stroebel, D.; Genovesio, A.; Roest Crollius, H.
Show abstract
Amino acids evolve at different speeds within protein sequences, because their functional and structural roles are different. However, the position of an amino-acid within the sequence is not known to influence this evolutionary speed. Here we discovered that amino-acid evolve almost twice faster at protein termini than in their centre, hinting at a strong topological bias along the sequence length. We further show that the distribution of functional domains and of solvent-accessible residues in proteins readily explain how functional constrains are weaker at their termini, leading to the observed excess of amino-acid substitutions. Finally, we show that methods inferring sites under positive selection are strongly biased towards protein termini, suggesting that they may confound positive selection with weak negative selection. These results suggest that accounting for positional information should improve evolutionary models.
Peede, D.; Ortega-Del Vecchyo, D.; Huerta-Sanchez, E.
Show abstract
The past decade has ushered in a resurgence of studies highlighting the importance of introgression throughout the Tree of Life. Several methods exist for detecting and quantifying introgression on a genomic scale, yet the majority of these methods primarily utilize signals of derived allele sharing between donor and recipient populations. In this study, we exploit the fact that introgression will not only result in derived allele sharing but also the reintroduction of ancestral alleles to derive new estimators of the admixture proportion. Using coalescent simulations, we assess the performance of our new methods and the methods proposed in Lopez Fang et al. 2022 to assess the utility of incorporating shared ancestral variation into genome-wide inferences of introgression. Using coalescent theory, simulations, and applying our methods to human and canid data, we find that methods incorporating ancestral allele sharing are comparable to their derived allele sharing counterparts, in turn providing researchers with the opportunity to utilize more of the genomic signature of introgression.
Montoya, P.; Fabre, A.-C.; Goswami, A.; Morlon, H.; Clavel, J.
Show abstract
Multivariate phylogenetic comparative methods for modelling high-dimensional traits such as 3D shapes or gene expression proBiles have been recently developed. However, these approaches are impractical and almost impossible to use when the number of traits exceeds a few thousands, as they become computationally prohibitive. We overcome these limitations by proposing a new maximum likelihood approach based on the Empirical Bayes framework. This approach takes into account the information of the complete covariances (among species and traits) to infer parameters and compare models of trait evolution for high-dimensional datasets. Through simulations, we demonstrate that the proposed approach can accurately estimate parameters of various trait evolution models, even when the number of traits is ten times larger than the number of lineages; it requires less memory and is at least 10 times faster than currently available approaches. This fast, efBicient framework enabled us to extend the high-dimensional multivariate phylogenetic comparative toolkit by including an Ornstein-Uhlenbeck process with multiple optima to study adaptation to various selective regimes. Applying our approach to the evolution of jaw morphology in relation to dietary adaptation in mammals, we demonstrate morphological convergence in carnivorous and herbivorous lineages. The proposed Empirical Bayes framework, implemented in the R package mvMORPH, enables phylogenetic comparative methods to efBiciently handle high-dimensional datasets and complex models of trait evolution.
Peterson, K. J.; Clarke, A.; Zolotarov, G.; Deline, B.; McPeek, M.; Martinez, P.; Fromm, B.
Show abstract
A prevailing problem in evolutionary biology is elucidating the "genotype-phenotype map" that characterizes how genomic activities regulate different aspects of organismal morphology and their variability in both space and time. Here, we explore potential causality between genome content and both morphological complexity and disparity by compiling the regulatory components (i.e., transcription factors, RNA binding proteins, and microRNA families) as well as a representative set of non-regulatory "housekeeping genes" in 32 species belonging to a wide variety of animal phyla, altogether encapsulating a number of varying genomic characteristics and morphological diversities. A principal component analysis of these four non-overlapping genomic components from each of these 32 species in relation to their last common ancestor revealed that no relationship exists between genome space and disparity, as changes to animal body plans appear to be largely the result of changes to the gene regulatory networks that govern animal development rather than gaining or losing specific sets of regulatory genes. However, using both phylogenetically correlated as well as phylogenetically uncorrelated statistical tests, we find a strong relationship between the loss of all considered gene types and the advent of some parasitic taxa, as well as between microRNA innovations and organismal complexity. While this analysis of genomic features suggests how complexity and disparity are each encoded in the genome, further analysis of the regulatory networks in which they participate should provide a more comprehensive description of how organisms diversify their morphologies over time through alterations in their genomic components. SIGNIFICANCE STATEMENTVariation in morphological form can be measured by considering either disparity (i.e., the amount of variance in morphological form) or complexity (i.e., the difference in the number of their "parts" including genes or cells). To date, discerning the genomic basis of either has remained largely unknown. Here, we compile the regulatory genome of the last common ancestor of bilaterian metazoans, including its transcription factors, RNA-binding proteins and microRNA families, and ask how changes to these repertoires potentially reflects changes in either disparity or complexity. We find that although there is no relationship between the regulatory genome and disparity, there is a robust relationship between the loss of regulatory genes relative to housekeeping genes, especially in parasitic taxa. Further, there is a strong relationship between increases to microRNAs and complexity, which likely reflects the unique role microRNAs play in increasing both the accuracy and the precision of gene expression during development.
Braichenko, S.; Borges, R.; Kosiol, C.
Show abstract
The role of balancing selection is a long-standing evolutionary puzzle. Balancing selection is a crucial evolutionary process that maintains genetic variation (polymorphism) over extended periods of time; however, detecting it poses a significant challenge. Building upon the polymorphismaware phylogenetic models (PoMos) framework rooted in the Moran model, we introduce Po-MoBalance model. This novel approach is designed to disentangle the interplay of mutation, genetic drift, directional selection (GC-biased gene conversion), along with the previously unexplored balancing selection pressures on ultra-long timescales comparable with species divergence times by analysing multi-individual genomic and phylogenetic divergence data. Implemented in the open-source RevBayes Bayesian framework, PoMoBalance offers a versatile tool for inferring phylogenetic trees as well as quantifying various selective pressures. The novel aspect of our approach in studying balancing selection lies in PoMos ability to account for ancestral polymorphisms and incorporate parameters that measure frequency-dependent selection, allowing us to determine the strength of the effect and exact frequencies under selection. We implemented validation tests and assessed the model on the data simulated with SLiM and a custom Moran model simulator. Real sequence analysis of Drosophila populations reveals insights into the evolutionary dynamics of regions subject to frequency-dependent balancing selection, particularly in the context of sex-limited colour dimorphism in Drosophila erecta.
Darragh, A. C.; Rifkin, S. A.
Show abstract
Transcription factors are defined by their DNA-binding domains (DBDs). The binding affinities and specificities of a transcription factor to its DNA binding sites can be used by an organism to fine-tune gene regulation and so are targets for evolution. Here we investigate the evolution of GATA-type transcription factors (GATA factors) in the Caenorhabditis genus. Based upon comparisons of their DBDs, these proteins form 13 distinct groups. This protein family experienced a burst of gene duplication in several of these groups along two short branches in the species tree, giving rise to subclades with very distinct complements of GATA factors. By comparing extant gene structures, DBD sequences, genome locations, and selection pressures we reconstructed how these duplications occurred. Although the paralogs have diverged in various ways, the literature shows that at least eight of the DBD groups bind to similar G-A-T-A DNA sequences. Thus, despite gene duplications and divergence among DBD sequences, most Caenorhabditis GATA factors appear to have maintained similar binding preferences, which could create the opportunity for developmental system drift. We hypothesize that this limited divergence in binding specificities contributes to the apparent disconnect between the extensive genomic evolution that has occurred in this genus and the absence of significant anatomical changes.
Williams, T. A.; Davin, A. A.; Morel, B.; Szantho, L. L.; Spang, A.; Stamatakis, A.; Hugenholtz, P.; Szollosi, G. J.
Show abstract
Species tree-aware phylogenetic methods model how gene trees are generated along the species tree by a series of evolutionary events, including the duplication, transfer and loss of genes. Over the past ten years these methods have emerged as a powerful tool for inferring and rooting gene and species trees, inferring ancestral gene repertoires, and studying the processes of gene and genome evolution. However, these methods are complex and can be more difficult to use than traditional phylogenetic approaches. Method development is rapid, and it can be difficult to decide between approaches and interpret results. Here, we review ALE and GeneRax, two popular packages for reconciling gene and species trees, explaining how they work, how results can be interpreted, and providing a tutorial for practical analysis. It was recently suggested that reconciliation-based estimates of duplication and transfer frequencies are unreliable. We evaluate this criticism and find that, provided parameters are estimated from the data rather than being fixed based on prior assumptions, reconciliation-based inferences are in good agreement with the literature, recovering variation in gene duplication and transfer frequencies across lineages consistent with the known biology of studied clades. For example, published datasets support the view that transfers greatly outnumber duplications in most prokaryotic lineages. We conclude by discussing some limitations of current models and prospects for future progress. Significance statementEvolutionary trees provide a framework for understanding the history of life and organising biodiversity. In this review, we discuss some recent progress on statistical methods that allow us to combine information from many different genes within the framework of an overarching phylogenetic species tree. We review the advantages and uses of these methods and discuss case studies where they have been used to resolve deep branches within the tree of life. We conclude with the limitations of current methods and suggest how they might be overcome in the future.
Booker, T. R.; Yeaman, S.; Whitlock, M. C.
Show abstract
Adaptation occurring in similar genes or genomic regions in distinct lineages provides evolutionary biologists with a glimpse at the fundamental opportunities for and constraints to diversification. With the widespread availability of high throughput sequencing technologies and the development of population genetic methods to identify the genetic basis of adaptation, studies have begun to compare the evidence for adaptation at the molecular level among distinct lineages. However, methods to study repeated adaptation are often oriented towards genome-wide testing to identify a set of genes with signatures of repeated use, rather than evaluating the significance at the level of an individual gene. In this study, we propose PicMin, a novel statistical method derived from the theory of order statistics that can test for repeated molecular evolution to estimate significance at the level of an individual gene, using the results of genome scans. This method is generalizable to any number of lineages and indeed, statistical power to detect repeated adaptation increases with the number of lineages that have signals of repeated adaptation of a given gene in multiple lineages. An implementation of the method written for R can be downloaded from https://github.com/TBooker/PicMin.
McCarthy, C. G. P.; Susko, E.; Roger, A. J.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWPhylogenetic trees are often inferred from protein sequences sampled from diverse taxa across the tree of life. The compositions of these amino acid sequences may be heterogeneous across both sites and branches, particularly if deep phylogenetic divergences are the focus. Under some conditions, failure to model this compositional heterogeneity can lead to phylogenetic artefacts. However, the computational cost of phylogenetic inference with models accounting for compositional heterogeneity can be prohibitive. The originally proposed site-and-branch-heterogeneous GFmix model accounts for changing relative frequencies of G, A, R, and P (GARP) vs. F, Y, M, I, N, and K (FYMINK) amino acids resulting from extreme variation in G+C content among taxa. This GFmix model modifies a fitted site-heterogeneous profile mixture model in a branch-specific manner using parameters that reflect branch-specific amino acid compositions. This approach has been shown to improve likelihoods and reduce compositional artifacts. However, the original implementation of the model includes constraints which may sacrifice accuracy for computability and is limited to modeling variation in GARP/FYMINK composition. Here we investigate the properties of the original GFmix model in greater depth and present several improvements to the model. The improved GFmix models permit fewer constraints on branch-specific composition parameters, allow modeling of user-defined compositional heterogeneity, and provide for full maximum-likelihood optimization of parameters. We have also developed new methods for detecting compositional heterogeneity directly from sequence data. Analyses of simulated site-and-branch-heterogeneous data indicates that the improved GFmix models better estimate branch-specific compositions and branch lengths in heterogeneous trees. We applied the various versions of the GFmix model to a real dataset with known compositional heterogeneity artefacts. We find that the most complex GFmix model with full maximum likelihood parameter optimization consistently supports the correct tree over the artefactual tree with improved likelihoods. All versions of the GFmix model are available from https://www.mathstat.dal.ca/~tsusko/software.html.
Otto, M.; Wiehe, T.
Show abstract
Gene duplication plays a crucial role in the adaptive evolution and diversification of organisms by creating extra copies of genes that can evolve new functions while preserving the original. Duplicated genes can become fixed in populations or appear as copy number variants. However, inferring and dating these duplication events from present-day data is challenging, as gene copy count distributions could result from either a few ancient duplication events or many recent ones. Sequence based phylogenetic reconstruction, an often seen practice, does not include the history of individuals and hence may result in inconsistencies, which may lead to misinterpretations. Here, we introduce a novel model for inferring gene copy number evolution, which describes gene duplication and their evolution over time through a random walk on a coalescent duplication network. This approach is solely based on copy number counts and hence independent of the inconsistencies of sequence based inferences. Backward in time we implement structured coalescent simulations, where we re-interprete structure as genealogical distance based on copy number counts. We apply this model to the NB-ARC domain counts of NLR genes in A. thaliana to infer the number and times of duplication events that have led to the present day copy number distribution.
Biba, D.; Klink, G. V.; Bazykin, G. A.
Show abstract
Insertions and deletions of lengths not divisible by 3 in protein-coding sequences cause frameshifts that usually induce premature stop codons and may carry a high fitness cost. However, this cost can be partially offset by a second compensatory indel restoring the reading frame. The role of such pairs of compensatory frameshifting mutations (pCFMs) in evolution has not been studied systematically. Here, we use whole-genome alignments of protein coding genes of 100 vertebrate species, and of 122 insect species, studying the prevalence of pCFMs in their divergence. We detect a total of 619 candidate pCFM-genes; 11 of them pass stringent quality filtering, including three human genes: RAB36, ARHGAP6 and NCR3LG1. In some instances, amino acid substitutions closely predating or following pCFMs restored the biochemical similarity of the frameshifted segment to the ancestral amino acid sequence, possibly reducing or negating the fitness cost of the pCFM. Typically, however, the resulting sequence bore no biochemical similarity to the ancestral one, indicating that pCFMs can uncover radically novel regions of protein space. In total, pCFMs represent an appreciable and previously overlooked source of novel variation in amino acid sequences.
Sethuraman, A.; Sousa, V.; Hey, J.
Show abstract
Demographic changes such as fluctuating population size and differential introgression can mask the effects of natural selection, and affect rates of genome evolution, local adaptation, reproductive isolation, and eventual speciation. Besides identifying differentially introgressing genes (and genomic regions) that are \"labeled\" to be retroactively causal to adaptive evolution and speciation, there is significant impetus to understand, and perhaps estimate the underlying demography that affects current genomic diversity. Using model-based likelihood methods to directly estimate, and decouple the effects of differential intogression and demography across genomic loci offers an ideal solution to detect differential introgression, and population demography and build hypotheses around its underlying evolutionary processes. We describe a computationally efficient parallelized implementation of mixture-model based isolation with migration (IM) analyses to assign loci to classes based on their shared coalescent histories (population sizes, or migration rates). We apply this method to several genomic data sets (great apes - chimpanzees and bonobos, Anopheles mosquitoes, threespine sticklebacks, Mullerian mimics of Heliconius butterflies, mice, European rabbits, and fruitflies), that have been previously characterized (perhaps erroneously) using genome-wide scans of differentiation. We show that we cannot reject a model of differential introgression, or linked selection across a majority of species analyzed, with two species showing the combined effects of differential introgression and linked natural selection across multiple, non-independent genomic loci.
Aube, S.; Nielly-Thibault, L.; Landry, C. R.
Show abstract
How changes in the different steps of protein synthesis - transcription, translation and degradation - contribute to differences of protein abundance among genes is not fully understood. There is however accumulating evidence that transcriptional divergence might have a prominent role. Here, we show that yeast paralogous genes are more divergent in transcription than in translation. We explore two causal mechanisms for this predominance of transcriptional divergence: an evolutionary trade-off between the precision and economy of gene expression and a larger mutational target size for transcription. Performing simulations within a minimal model of post-duplication evolution, we find that both mechanisms are consistent with the observed divergence patterns. We also investigate how additional properties of the effects of mutations on gene expression, such as their asymmetry and correlation across levels of regulation, can shape the evolution of duplicates. Our results highlight the importance of fully characterizing the distributions of mutational effects on transcription and translation. They also show how general trade-offs in cellular processes and mutation bias can have far-reaching evolutionary impacts.
Cooper, J. C.; Leonard, C. J.; Pedersen, B. S.; Carey, C. M.; Quinlan, A. R.; Elde, N. C.; Phadnis, N.
Show abstract
Recurrent positive selection at the codon level is often a sign that a gene is engaged in a molecular arms race - a conflict between the genome of its host and the genome of another species over mutually exclusive access to a resource that has a direct effect on the fitness of both individuals. Detecting molecular arms races has led to a better understanding of how evolution changes the molecular interfaces of proteins when organisms compete over time, especially in the realm of host-pathogen interactions. Here, we present a method for detection of gene-level recurrent positive selection across entire genomes for a given phylogenetic group. We deploy this method on five mammalian clades - primates, mice, deer mice, dogs, and bats - to both detect novel instances of recurrent positive selection and to compare the prevalence of recurrent positive selection between clades. We analyze the frequency at which individual genes are targets of recurrent positive selection in multiple clades. We find that coincidence of selection occurs far more frequently than expected by chance, indicating that all clades experience shared selective pressures. Additionally, we highlight Polymeric Immunoglobulin Receptor (PIGR) as a gene which shares specific amino acids under recurrent positive selection in multiple clades, indicating that it has been locked in a molecular arms race for [~]100My. These data provide an in-depth comparison of recurrent positive selection across the mammalian phylogeny, and highlights of the power of comparative evolutionary approaches to generate specific hypotheses about the molecular interactions of rapidly evolving genes.
Moutinho, A. F.; Eyre-Walker, A.
Show abstract
Bias in synonymous codon usage has been reported across all kingdoms of life. Evidence across species suggests that codon usage bias is often driven by selective pressures, typically for translational efficiency. These selective pressures have been shown to depress the rate at which synonymous sites evolve. We hypothesise that selection on synonymous codon use could also slow the rate of protein evolution if two amino acids have different preferred codons. We test this hypothesis by looking at patterns of protein evolution using polymorphism and substitution data in bacteria. We found that non-synonymous mutations that change from unpreferred to preferred codons are more common than the opposite, but only amongst codons that vary substantially in their preference level. Overall, selection on codon bias seems to have little influence over non-synonymous polymorphism or substitution patterns.
Schultz, D. T.; Heath-Heckman, E. A. C.; Winchell, C. J.; Kuo, D.-H. T.; Yu, Y.-s.; Oberauer, F.; Kocot, K.; Cho, S.-J.; Simakov, O.; Weisblat, D. A.
Show abstract
Comparisons of multiple metazoan genomes have revealed the existence of ancestral linkage groups (ALGs), genomic scaffolds sharing sets of orthologous genes that have been inherited from ancestral animals for hundreds of millions of years (Simakov et al. 2022; Schultz et al. 2023) These ALGs have persisted across major animal taxa including Cnidaria, Deuterostomia, Ecdysozoa and Spiralia. Notwithstanding this general trend of chromosome-scale conservation, ALGs have been obliterated by extensive genome rearrangements in certain groups, most notably including Clitellata (oligochaetes and leeches), a group of easily overlooked invertebrates that is of tremendous ecological, agricultural and economic importance (Charles 2019; Barrett 2016). To further investigate these rearrangements, we have undertaken a comparison of 12 clitellate genomes (including four newly sequenced species) and 11 outgroup representatives. We show that these rearrangements began at the base of the Clitellata (rather than progressing gradually throughout polychaete annelids), that the inter-chromosomal rearrangements continue in several clitellate lineages and that these events have substantially shaped the evolution of the otherwise highly conserved Hox cluster.
Norn, C.; Andre, I.; Theobald, D. L.
Show abstract
Proteins evolve under a myriad of biophysical selection pressures that collectively control the patterns of amino acid substitutions. Averaged over time and across proteins, these evolutionary pressures are sufficiently consistent to produce global substitution patterns that can be used to successfully find homologues, infer phylogenies, and reconstruct ancestral sequences. Although the factors which govern the variation of protein substitution rates has received much attention, the influence of thermodynamic stability constraints remains unresolved. Here we develop a simple model to calculate amino acid rate matrices from evolutionary dynamics controlled by a fitness function that reports on the thermodynamic effects of amino acid mutations in protein structures. This hybrid biophysical and evolutionary model accounts for nucleotide transition/transversion rate bias, multi-nucleotide codon changes, the number of codons per amino acid, and thermodynamic protein stability. We find that our theoretical model accurately recapitulates the complex pattern of empirical rates observed in common global amino acid substitution matrices used in phylogenetics. These results suggest that selection for thermodynamically stable proteins, coupled with nucleotide mutation bias filtered by the structure of the genetic code, is the primary global driver behind the amino acid substitution patterns observed in proteins throughout the tree of life.